The following tutorial will let you reproduce the plots that we are going to create at the workshop using R.
Please read carefully and follow the steps. Wherever you see the Code icon on the right you can click on it to see the actual code used in that section.
This tutorial will focus on analysing the updated data of the worldwide Novel Corona virus (COVID-19) pandemic.
There are several data sources available online. We will use the data collected from a range of official sources and hosted on the Our World in Data website (Mathieu et al. 2021).
To run R and RStudio on Binder, click on this badge - .
Start RStudio and create a new project named Workshop3 in a new folder (if you need a reminder ho to do it, check out Workshop1 Tutorial on BB).
Once RStudio restarts inside the project's folder, create a new R script named Workshop3.R and 2 new folders, one named data for our input data and another named output for our plots.
For this analysis we will again use some packages from the Tidyverse, but this time we load the specific packages (which are supposed to be pre-installed on your computers) to try and avoid having to download the entire tidyverse. In addition to the Tidyverse packages we've got to know in the previous workshop we will use the plotly package to create interactive plots, paletteer for custom color palettes, readxl to read MS-Excel file, scales to format large numbers, lubridate to better handle dates, glue to paste together strings, patchwork to include several plots in a single figure and a few others to assist in getting the data into shape.
To install these packages, we will introduce a package called pacman that will assist in loading the required packages and installing them if they're not already installed. To install it we use the install.packages('pacman') command, please note that the package name need to be quoted and that we only need to perform it once, or when we want or need to update the package. Once the package was installed, we can load its functions using the library(pacman) command and then load/install all the other packages at once with p_load() function.
# install required packages - needed only once! (comment with a # after first use)
install.packages('pacman')
# load required packages
library(pacman)
p_load(dplyr, tidyr, ggplot2, readr, paletteer, glue, scales, plotly, lubridate, patchwork, visdat)More information on installing and using R packages can be found in this tutorial.
Now that we've got RStudio up and running and our packages installed and loaded, we can read data into R from our local computer or from web locations using dedicated functions specific to the file type (.csv, .txt, .xlsx, etc.).
We will use the read_csv() command/function from the readr package (part of the tidyverse) to load the data directly from a file on the the Our World in Data website into a variable of type data frame (table). If we don't want to use external packages, we can use the read.csv() function from base R, which won't automatically parse columns containing dates and in previous versions of R (< 4.0) will slightly change the structure of the resulting data frame (all text columns will be converted into factors).
> Note that in this case, we need to specify the column types because the data contains a lot of missing values that interfere with the automatic parsing of the column types._
# read data from Our World in Data website
covid_data <- read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv",
col_types = paste(c("c", "f", "c", "D", rep("d", 29),
"c", rep("d", 26)), collapse = ""))Let's use built-in functions for a brief data exploration, such as head() to show the first 10 rows of the data and str() for the type of data in each column:
#explore the data frame
head(covid_data) # show first 10 rows of the data and typr of variables## # A tibble: 6 x 60
## iso_code continent location date total_cases new_cases new_cases_smoot~
## <chr> <fct> <chr> <date> <dbl> <dbl> <dbl>
## 1 AFG Asia Afghani~ 2020-02-24 1 1 NA
## 2 AFG Asia Afghani~ 2020-02-25 1 0 NA
## 3 AFG Asia Afghani~ 2020-02-26 1 0 NA
## 4 AFG Asia Afghani~ 2020-02-27 1 0 NA
## 5 AFG Asia Afghani~ 2020-02-28 1 0 NA
## 6 AFG Asia Afghani~ 2020-02-29 1 0 0.143
## # ... with 53 more variables: total_deaths <dbl>, new_deaths <dbl>,
## # new_deaths_smoothed <dbl>, total_cases_per_million <dbl>,
## # new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
## # total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
## # new_deaths_smoothed_per_million <dbl>, reproduction_rate <dbl>,
## # icu_patients <dbl>, icu_patients_per_million <dbl>, hosp_patients <dbl>,
## # hosp_patients_per_million <dbl>, weekly_icu_admissions <dbl>,
## # weekly_icu_admissions_per_million <dbl>, weekly_hosp_admissions <dbl>,
## # weekly_hosp_admissions_per_million <dbl>, new_tests <dbl>,
## # total_tests <dbl>, total_tests_per_thousand <dbl>,
## # new_tests_per_thousand <dbl>, new_tests_smoothed <dbl>,
## # new_tests_smoothed_per_thousand <dbl>, positive_rate <dbl>,
## # tests_per_case <dbl>, tests_units <chr>, total_vaccinations <dbl>,
## # people_vaccinated <dbl>, people_fully_vaccinated <dbl>,
## # total_boosters <dbl>, new_vaccinations <dbl>,
## # new_vaccinations_smoothed <dbl>, total_vaccinations_per_hundred <dbl>,
## # people_vaccinated_per_hundred <dbl>,
## # people_fully_vaccinated_per_hundred <dbl>,
## # total_boosters_per_hundred <dbl>,
## # new_vaccinations_smoothed_per_million <dbl>, stringency_index <dbl>,
## # population <dbl>, population_density <dbl>, median_age <dbl>,
## # aged_65_older <dbl>, aged_70_older <dbl>, gdp_per_capita <dbl>,
## # extreme_poverty <dbl>, cardiovasc_death_rate <dbl>,
## # diabetes_prevalence <dbl>, female_smokers <dbl>, male_smokers <dbl>,
## # handwashing_facilities <dbl>, hospital_beds_per_thousand <dbl>,
## # life_expectancy <dbl>
str(covid_data) # show data structure## tibble [109,972 x 60] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ iso_code : chr [1:109972] "AFG" "AFG" "AFG" "AFG" ...
## $ continent : Factor w/ 6 levels "Asia","Europe",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ location : chr [1:109972] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ date : Date[1:109972], format: "2020-02-24" "2020-02-25" ...
## $ total_cases : num [1:109972] 1 1 1 1 1 1 1 1 2 4 ...
## $ new_cases : num [1:109972] 1 0 0 0 0 0 0 0 1 2 ...
## $ new_cases_smoothed : num [1:109972] NA NA NA NA NA 0.143 0.143 0 0.143 0.429 ...
## $ total_deaths : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ new_deaths : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ new_deaths_smoothed : num [1:109972] NA NA NA NA NA 0 0 0 0 0 ...
## $ total_cases_per_million : num [1:109972] 0.026 0.026 0.026 0.026 0.026 0.026 0.026 0.026 0.051 0.103 ...
## $ new_cases_per_million : num [1:109972] 0.026 0 0 0 0 0 0 0 0.026 0.051 ...
## $ new_cases_smoothed_per_million : num [1:109972] NA NA NA NA NA 0.004 0.004 0 0.004 0.011 ...
## $ total_deaths_per_million : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ new_deaths_per_million : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ new_deaths_smoothed_per_million : num [1:109972] NA NA NA NA NA 0 0 0 0 0 ...
## $ reproduction_rate : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ icu_patients : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ icu_patients_per_million : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ hosp_patients : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ hosp_patients_per_million : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_icu_admissions : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_icu_admissions_per_million : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_hosp_admissions : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_hosp_admissions_per_million : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ total_tests : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ total_tests_per_thousand : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_per_thousand : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_smoothed : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_smoothed_per_thousand : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ positive_rate : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ tests_per_case : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ tests_units : chr [1:109972] NA NA NA NA ...
## $ total_vaccinations : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ people_vaccinated : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ people_fully_vaccinated : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ total_boosters : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations_smoothed : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ total_vaccinations_per_hundred : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ people_vaccinated_per_hundred : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ people_fully_vaccinated_per_hundred : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ total_boosters_per_hundred : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations_smoothed_per_million: num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ stringency_index : num [1:109972] 8.33 8.33 8.33 8.33 8.33 ...
## $ population : num [1:109972] 38928341 38928341 38928341 38928341 38928341 ...
## $ population_density : num [1:109972] 54.4 54.4 54.4 54.4 54.4 ...
## $ median_age : num [1:109972] 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 ...
## $ aged_65_older : num [1:109972] 2.58 2.58 2.58 2.58 2.58 ...
## $ aged_70_older : num [1:109972] 1.34 1.34 1.34 1.34 1.34 ...
## $ gdp_per_capita : num [1:109972] 1804 1804 1804 1804 1804 ...
## $ extreme_poverty : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ cardiovasc_death_rate : num [1:109972] 597 597 597 597 597 ...
## $ diabetes_prevalence : num [1:109972] 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 ...
## $ female_smokers : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ male_smokers : num [1:109972] NA NA NA NA NA NA NA NA NA NA ...
## $ handwashing_facilities : num [1:109972] 37.7 37.7 37.7 37.7 37.7 ...
## $ hospital_beds_per_thousand : num [1:109972] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ life_expectancy : num [1:109972] 64.8 64.8 64.8 64.8 64.8 ...
## - attr(*, "problems")= tibble [109,972 x 5] (S3: tbl_df/tbl/data.frame)
## ..$ row : int [1:109972] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ col : chr [1:109972] NA NA NA NA ...
## ..$ expected: chr [1:109972] "60 columns" "60 columns" "60 columns" "60 columns" ...
## ..$ actual : chr [1:109972] "62 columns" "62 columns" "62 columns" "62 columns" ...
## ..$ file : chr [1:109972] "'https://covid.ourworldindata.org/data/owid-covid-data.csv'" "'https://covid.ourworldindata.org/data/owid-covid-data.csv'" "'https://covid.ourworldindata.org/data/owid-covid-data.csv'" "'https://covid.ourworldindata.org/data/owid-covid-data.csv'" ...
## - attr(*, "spec")=
## .. cols(
## .. iso_code = col_character(),
## .. continent = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. location = col_character(),
## .. date = col_date(format = ""),
## .. total_cases = col_double(),
## .. new_cases = col_double(),
## .. new_cases_smoothed = col_double(),
## .. total_deaths = col_double(),
## .. new_deaths = col_double(),
## .. new_deaths_smoothed = col_double(),
## .. total_cases_per_million = col_double(),
## .. new_cases_per_million = col_double(),
## .. new_cases_smoothed_per_million = col_double(),
## .. total_deaths_per_million = col_double(),
## .. new_deaths_per_million = col_double(),
## .. new_deaths_smoothed_per_million = col_double(),
## .. reproduction_rate = col_double(),
## .. icu_patients = col_double(),
## .. icu_patients_per_million = col_double(),
## .. hosp_patients = col_double(),
## .. hosp_patients_per_million = col_double(),
## .. weekly_icu_admissions = col_double(),
## .. weekly_icu_admissions_per_million = col_double(),
## .. weekly_hosp_admissions = col_double(),
## .. weekly_hosp_admissions_per_million = col_double(),
## .. new_tests = col_double(),
## .. total_tests = col_double(),
## .. total_tests_per_thousand = col_double(),
## .. new_tests_per_thousand = col_double(),
## .. new_tests_smoothed = col_double(),
## .. new_tests_smoothed_per_thousand = col_double(),
## .. positive_rate = col_double(),
## .. tests_per_case = col_double(),
## .. tests_units = col_character(),
## .. total_vaccinations = col_double(),
## .. people_vaccinated = col_double(),
## .. people_fully_vaccinated = col_double(),
## .. total_boosters = col_double(),
## .. new_vaccinations = col_double(),
## .. new_vaccinations_smoothed = col_double(),
## .. total_vaccinations_per_hundred = col_double(),
## .. people_vaccinated_per_hundred = col_double(),
## .. people_fully_vaccinated_per_hundred = col_double(),
## .. total_boosters_per_hundred = col_double(),
## .. new_vaccinations_smoothed_per_million = col_double(),
## .. stringency_index = col_double(),
## .. population = col_double(),
## .. population_density = col_double(),
## .. median_age = col_double(),
## .. aged_65_older = col_double(),
## .. aged_70_older = col_double(),
## .. gdp_per_capita = col_double(),
## .. extreme_poverty = col_double(),
## .. cardiovasc_death_rate = col_double(),
## .. diabetes_prevalence = col_double(),
## .. female_smokers = col_double(),
## .. male_smokers = col_double(),
## .. handwashing_facilities = col_double(),
## .. hospital_beds_per_thousand = col_double(),
## .. life_expectancy = col_double()
## .. )
We can also produce some descriptive statistics to better understand the data and the nature of each variable. The summary() function (as can be guessed by its name) provides a quick summary of basic descriptive statistics, such as the mean, min, max and quantiles for continuous numerical values.
# summary of variables in my data
summary(covid_data)## iso_code continent location
## Length:109972 Asia :25856 Length:109972
## Class :character Europe :25948 Class :character
## Mode :character Africa :28113 Mode :character
## North America:13815
## South America: 6536
## Oceania : 4621
## NA's : 5083
## date total_cases new_cases new_cases_smoothed
## Min. :2020-01-01 Min. : 1 Min. :-74347 Min. : -6223.0
## 1st Qu.:2020-07-18 1st Qu.: 1579 1st Qu.: 2 1st Qu.: 8.1
## Median :2020-12-06 Median : 16608 Median : 81 Median : 100.9
## Mean :2020-11-29 Mean : 1240219 Mean : 6251 Mean : 6259.5
## 3rd Qu.:2021-04-15 3rd Qu.: 175287 3rd Qu.: 864 3rd Qu.: 913.3
## Max. :2021-08-16 Max. :207834266 Max. :905993 Max. :826368.1
## NA's :4745 NA's :4748 NA's :5758
## total_deaths new_deaths new_deaths_smoothed
## Min. : 1 Min. :-1918.0 Min. : -232.143
## 1st Qu.: 61 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 492 Median : 2.0 Median : 1.429
## Mean : 32499 Mean : 145.7 Mean : 131.966
## 3rd Qu.: 4605 3rd Qu.: 18.0 3rd Qu.: 15.000
## Max. :4371025 Max. :18000.0 Max. :14726.429
## NA's :15200 NA's :15045 NA's :5758
## total_cases_per_million new_cases_per_million new_cases_smoothed_per_million
## Min. : 0.0 Min. :-3162.163 Min. :-276.825
## 1st Qu.: 300.7 1st Qu.: 0.247 1st Qu.: 1.437
## Median : 2248.7 Median : 9.460 Median : 12.717
## Mean : 15349.4 Mean : 78.687 Mean : 78.703
## 3rd Qu.: 16914.9 3rd Qu.: 74.659 3rd Qu.: 82.822
## Max. :193541.7 Max. :18293.675 Max. :4083.500
## NA's :5302 NA's :5305 NA's :6310
## total_deaths_per_million new_deaths_per_million
## Min. : 0.001 Min. :-76.445
## 1st Qu.: 9.200 1st Qu.: 0.000
## Median : 60.276 Median : 0.137
## Mean : 330.726 Mean : 1.536
## 3rd Qu.: 378.053 3rd Qu.: 1.313
## Max. :5986.714 Max. :218.329
## NA's :15744 NA's :15589
## new_deaths_smoothed_per_million reproduction_rate icu_patients
## Min. :-10.921 Min. :-0.01 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.: 0.84 1st Qu.: 29.0
## Median : 0.171 Median : 1.01 Median : 141.0
## Mean : 1.391 Mean : 1.01 Mean : 960.1
## 3rd Qu.: 1.288 3rd Qu.: 1.18 3rd Qu.: 599.0
## Max. : 63.140 Max. : 5.91 Max. :28889.0
## NA's :6310 NA's :21270 NA's :98139
## icu_patients_per_million hosp_patients hosp_patients_per_million
## Min. : 0.00 Min. : 0 Min. : 0.00
## 1st Qu.: 3.80 1st Qu.: 107 1st Qu.: 20.40
## Median : 13.54 Median : 558 Median : 70.89
## Mean : 23.90 Mean : 4268 Mean : 155.42
## 3rd Qu.: 37.54 3rd Qu.: 2407 3rd Qu.: 218.32
## Max. :192.55 Max. :133214 Max. :1532.57
## NA's :98139 NA's :95760 NA's :95760
## weekly_icu_admissions weekly_icu_admissions_per_million weekly_hosp_admissions
## Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 6.93 1st Qu.: 1.57 1st Qu.: 36.01
## Median : 33.36 Median : 6.87 Median : 245.31
## Mean : 239.87 Mean : 18.43 Mean : 2969.81
## 3rd Qu.: 180.50 3rd Qu.: 20.27 3rd Qu.: 1199.00
## Max. :4002.46 Max. :279.22 Max. :116323.00
## NA's :108934 NA's :108934 NA's :107958
## weekly_hosp_admissions_per_million new_tests total_tests
## Min. : 0.00 Min. :-2901036 Min. : 0
## 1st Qu.: 7.53 1st Qu.: 1784 1st Qu.: 187827
## Median : 30.42 Median : 6591 Median : 955708
## Mean : 90.86 Mean : 50225 Mean : 8926626
## 3rd Qu.: 101.99 3rd Qu.: 25350 3rd Qu.: 3953262
## Max. :2656.91 Max. : 3740296 Max. :499128115
## NA's :107958 NA's :61542 NA's :61830
## total_tests_per_thousand new_tests_per_thousand new_tests_smoothed
## Min. : 0.00 Min. :-53.32 Min. : 0
## 1st Qu.: 16.46 1st Qu.: 0.16 1st Qu.: 1883
## Median : 82.39 Median : 0.66 Median : 7087
## Mean : 371.17 Mean : 2.27 Mean : 47435
## 3rd Qu.: 345.03 3rd Qu.: 2.06 3rd Qu.: 28163
## Max. :12153.52 Max. :327.09 Max. :3080396
## NA's :61830 NA's :61542 NA's :53096
## new_tests_smoothed_per_thousand positive_rate tests_per_case
## Min. : 0.00 Min. :0.00 Min. : 1.1
## 1st Qu.: 0.16 1st Qu.:0.02 1st Qu.: 7.8
## Median : 0.70 Median :0.05 Median : 19.2
## Mean : 2.17 Mean :0.09 Mean : 161.2
## 3rd Qu.: 2.12 3rd Qu.:0.13 3rd Qu.: 57.7
## Max. :90.85 Max. :0.93 Max. :50000.0
## NA's :53096 NA's :56757 NA's :57389
## tests_units total_vaccinations people_vaccinated
## Length:109972 Min. :0.000e+00 Min. :0.000e+00
## Class :character 1st Qu.:1.736e+05 1st Qu.:1.429e+05
## Mode :character Median :1.348e+06 Median :9.315e+05
## Mean :4.841e+07 Mean :2.585e+07
## 3rd Qu.:8.607e+06 3rd Qu.:5.312e+06
## Max. :4.721e+09 Max. :2.446e+09
## NA's :88150 NA's :89076
## people_fully_vaccinated total_boosters new_vaccinations
## Min. :1.000e+00 Min. : 10 Min. : 0
## 1st Qu.:6.988e+04 1st Qu.: 990 1st Qu.: 5213
## Median :5.834e+05 Median : 2767 Median : 30445
## Mean :1.511e+07 Mean : 198473 Mean : 780013
## 3rd Qu.:3.656e+06 3rd Qu.: 186094 3rd Qu.: 170982
## Max. :1.840e+09 Max. :1669265 Max. :56134025
## NA's :92064 NA's :109791 NA's :91851
## new_vaccinations_smoothed total_vaccinations_per_hundred
## Min. : 0 Min. : 0.00
## 1st Qu.: 893 1st Qu.: 3.51
## Median : 7455 Median : 17.78
## Mean : 373313 Mean : 33.79
## 3rd Qu.: 50840 3rd Qu.: 53.17
## Max. :43413743 Max. :233.10
## NA's :71527 NA's :88150
## people_vaccinated_per_hundred people_fully_vaccinated_per_hundred
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.81 1st Qu.: 1.58
## Median : 12.44 Median : 7.04
## Mean : 21.07 Mean : 14.78
## 3rd Qu.: 35.05 3rd Qu.: 22.14
## Max. :116.85 Max. :116.26
## NA's :89076 NA's :92064
## total_boosters_per_hundred new_vaccinations_smoothed_per_million
## Min. : 0.00 Min. : 0
## 1st Qu.: 0.00 1st Qu.: 446
## Median : 0.00 Median : 2003
## Mean : 0.55 Mean : 3491
## 3rd Qu.: 0.04 3rd Qu.: 5133
## Max. :11.13 Max. :118759
## NA's :109791 NA's :71527
## stringency_index population population_density median_age
## Min. : 0.00 Min. :4.700e+01 Min. : 0.137 Min. :15.10
## 1st Qu.: 43.52 1st Qu.:2.083e+06 1st Qu.: 36.253 1st Qu.:22.20
## Median : 59.26 Median :9.660e+06 Median : 83.479 Median :29.70
## Mean : 57.69 Mean :1.231e+08 Mean : 391.932 Mean :30.54
## 3rd Qu.: 73.61 3rd Qu.:3.347e+07 3rd Qu.: 209.588 3rd Qu.:39.10
## Max. :100.00 Max. :7.795e+09 Max. :20546.766 Max. :48.20
## NA's :18759 NA's :771 NA's :8057 NA's :12230
## aged_65_older aged_70_older gdp_per_capita extreme_poverty
## Min. : 1.144 Min. : 0.526 Min. : 661.2 Min. : 0.10
## 1st Qu.: 3.466 1st Qu.: 2.063 1st Qu.: 4466.5 1st Qu.: 0.60
## Median : 6.378 Median : 3.871 Median : 12951.8 Median : 2.20
## Mean : 8.776 Mean : 5.555 Mean : 19276.9 Mean :13.45
## 3rd Qu.:14.312 3rd Qu.: 8.678 3rd Qu.: 27216.4 3rd Qu.:21.20
## Max. :27.049 Max. :18.493 Max. :116935.6 Max. :77.60
## NA's :13322 NA's :12768 NA's :11805 NA's :43915
## cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers
## Min. : 79.37 Min. : 0.990 Min. : 0.10 Min. : 7.70
## 1st Qu.:168.71 1st Qu.: 5.310 1st Qu.: 1.90 1st Qu.:21.60
## Median :242.65 Median : 7.110 Median : 6.30 Median :31.40
## Mean :258.89 Mean : 7.963 Mean :10.58 Mean :32.72
## 3rd Qu.:329.94 3rd Qu.:10.080 3rd Qu.:19.30 3rd Qu.:41.10
## Max. :724.42 Max. :30.530 Max. :44.00 Max. :78.10
## NA's :11890 NA's :9173 NA's :33345 NA's :34463
## handwashing_facilities hospital_beds_per_thousand life_expectancy
## Min. : 1.19 Min. : 0.100 Min. :53.28
## 1st Qu.: 19.35 1st Qu.: 1.300 1st Qu.:67.92
## Median : 49.84 Median : 2.400 Median :74.62
## Mean : 50.78 Mean : 3.026 Mean :73.25
## 3rd Qu.: 83.24 3rd Qu.: 3.861 3rd Qu.:78.74
## Max. :100.00 Max. :13.800 Max. :86.75
## NA's :60704 NA's :20733 NA's :5644
# find extreme rows
covid_data %>% arrange(desc(total_cases))## # A tibble: 109,972 x 60
## iso_code continent location date total_cases new_cases new_cases_smoot~
## <chr> <fct> <chr> <date> <dbl> <dbl> <dbl>
## 1 OWID_WRL <NA> World 2021-08-16 207834266 566906 635151.
## 2 OWID_WRL <NA> World 2021-08-15 207267360 459851 648444.
## 3 OWID_WRL <NA> World 2021-08-14 206807509 526728 645223.
## 4 OWID_WRL <NA> World 2021-08-13 206280781 866643 647795.
## 5 OWID_WRL <NA> World 2021-08-12 205414138 684874 641319
## 6 OWID_WRL <NA> World 2021-08-11 204729264 706957 641757
## 7 OWID_WRL <NA> World 2021-08-10 204022307 634095 636638.
## 8 OWID_WRL <NA> World 2021-08-09 203388212 659957 637593.
## 9 OWID_WRL <NA> World 2021-08-08 202728255 437309 627202.
## 10 OWID_WRL <NA> World 2021-08-07 202290946 544732 630876.
## # ... with 109,962 more rows, and 53 more variables: total_deaths <dbl>,
## # new_deaths <dbl>, new_deaths_smoothed <dbl>, total_cases_per_million <dbl>,
## # new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
## # total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
## # new_deaths_smoothed_per_million <dbl>, reproduction_rate <dbl>,
## # icu_patients <dbl>, icu_patients_per_million <dbl>, hosp_patients <dbl>,
## # hosp_patients_per_million <dbl>, weekly_icu_admissions <dbl>,
## # weekly_icu_admissions_per_million <dbl>, weekly_hosp_admissions <dbl>,
## # weekly_hosp_admissions_per_million <dbl>, new_tests <dbl>,
## # total_tests <dbl>, total_tests_per_thousand <dbl>,
## # new_tests_per_thousand <dbl>, new_tests_smoothed <dbl>,
## # new_tests_smoothed_per_thousand <dbl>, positive_rate <dbl>,
## # tests_per_case <dbl>, tests_units <chr>, total_vaccinations <dbl>,
## # people_vaccinated <dbl>, people_fully_vaccinated <dbl>,
## # total_boosters <dbl>, new_vaccinations <dbl>,
## # new_vaccinations_smoothed <dbl>, total_vaccinations_per_hundred <dbl>,
## # people_vaccinated_per_hundred <dbl>,
## # people_fully_vaccinated_per_hundred <dbl>,
## # total_boosters_per_hundred <dbl>,
## # new_vaccinations_smoothed_per_million <dbl>, stringency_index <dbl>,
## # population <dbl>, population_density <dbl>, median_age <dbl>,
## # aged_65_older <dbl>, aged_70_older <dbl>, gdp_per_capita <dbl>,
## # extreme_poverty <dbl>, cardiovasc_death_rate <dbl>,
## # diabetes_prevalence <dbl>, female_smokers <dbl>, male_smokers <dbl>,
## # handwashing_facilities <dbl>, hospital_beds_per_thousand <dbl>,
## # life_expectancy <dbl>
What are the metadata columns that describe our observations?
continent
location
date
Why do we have observations with the continent as NA?
# check which location have continent as NA
covid_data %>% filter(is.na(continent)) %>% count(location)## # A tibble: 9 x 2
## location n
## * <chr> <int>
## 1 Africa 551
## 2 Asia 573
## 3 Europe 572
## 4 European Union 572
## 5 International 557
## 6 North America 573
## 7 Oceania 570
## 8 South America 542
## 9 World 573
Some rows contain summarised data of entire continents/World, we'll need to remove those
We can see that most of our data contains '0' (check the difference between the median and the mean in total_cases and total_deaths columns). Just to confirm that, let's plot a histogram of all the confirmed cases
ggplot(covid_data, aes(x=total_cases)) +
geom_histogram(fill="lightskyblue") +
theme_bw(def_text_size)The data is evolving over days (a time-series), to there's no point treating it as a random population sample.
Let's look at confirmed cases and total deaths data for the 10 most affected countries (to date). To find out these countries so we need to wrangle our data a little bit using the following steps:
filter(!is.na(continent)group_by()arrange(desc(date))slice(1) and remove grouping with ungroup()inner_join()Optional step:
location variable as a factor and order it so the countries will be ordered in the legend by the number of cases withThen we can look at the data as a table and make a plot with the number of cases in the y-axis and date in the x-axis.
# find the 10 most affected countries (to date)
latest_data <- covid_data %>% filter(!is.na(continent)) %>%
group_by(location) %>% arrange(desc(date)) %>% slice(1) %>% ungroup()
most_affected_countries <- latest_data %>%
arrange(desc(total_deaths)) %>% slice(1:10) %>%
select(location)
# subset just the data from the 10 most affected countries and order them from the most affected to the least one
most_affected_data <- covid_data %>%
inner_join(most_affected_countries) %>%
mutate(Country=factor(location, levels = most_affected_countries$location))
# create a line plot the data of total cases
ggplot(most_affected_data, aes(x=date, y=total_cases, colour=Country)) +
geom_line(size=0.75) + scale_y_continuous(labels=comma) +
scale_color_paletteer_d("rcartocolor::Bold") +
labs(color="Country", y = "Total COVID-19 cases") +
theme_bw(def_text_size)It's a bit hard to figure out how the pandemic evolved because the numbers in US, Brazil and India are an order of magnitude larger than the rest (which are very close to each other). How can we make it more visible (and also improve how of the dates appear in the x-axis)?
# better formatting of date axis, log scale
plot <- ggplot(most_affected_data, aes(x=date, y=total_cases, colour=Country)) +
geom_line(size=0.75) + scale_y_log10(labels=comma) +
scale_x_date(NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
scale_color_paletteer_d("rcartocolor::Bold") +
labs(color="Country", y = "Total COVID-19 cases") +
theme_bw(def_text_size)
# show an interactive plot
ggplotly(plot)Why did we get a warning message and why the graphs don't start at the bottom of the x-axis? How can we solve it? What can we infer from the graph (exponential increase)?
What happens when we take the log of 0?? Can we remove those 0s with the `filter()` function (or add a very small number to them)?
We can see a very similar trend for most countries and while the curve has flattened substantially in April last year, the numbers are still rising. It is also evident that Europe got hit by a second wave arount October last year and India in April this year.
Let's have a look at the total deaths in these countries (and get rid of the minor grid lines to make Frank happy)
# create a line plot the data of total deaths
ggplot(most_affected_data, aes(x=date, y=total_deaths, colour=Country)) +
geom_line(size=0.75) + scale_y_continuous(labels=comma) +
scale_x_date(NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
scale_color_paletteer_d("rcartocolor::Bold") +
labs(color="Country", y = "Total deaths") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank()) # remove minor grid linesLet's have a look at the number of vaccinated people.
# vaccination rates
ggplot(most_affected_data, aes(x=date, y=people_vaccinated, colour=Country)) +
geom_line(size=0.75) + scale_y_continuous(labels=comma) +
scale_color_paletteer_d("rcartocolor::Bold") +
scale_x_date(NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())The graphs are "broken", meaning that it is not continuous and we have some missing data.
Let's visualise some of the variables in our data and assess "missingness".
# visualise missingness
vis_dat(covid_data %>% filter(date>dmy("01-01-2021")) %>%
select(continent, location, total_cases, total_deaths,
hosp_patients, people_vaccinated, people_fully_vaccinated))# find which countries has the most number of observations (least missing data)
covid_data %>% filter(!is.na(continent), !is.na(people_vaccinated)) %>% # group_by(location) %>%
count(location) %>% arrange(desc(n)) %>% print(n=30)## # A tibble: 218 x 2
## location n
## <chr> <int>
## 1 Norway 254
## 2 Canada 245
## 3 Israel 240
## 4 Latvia 240
## 5 Denmark 239
## 6 Liechtenstein 234
## 7 Switzerland 234
## 8 Chile 233
## 9 Austria 232
## 10 Czechia 232
## 11 Estonia 232
## 12 Germany 232
## 13 Italy 232
## 14 Lithuania 232
## 15 Slovenia 232
## 16 Bahrain 231
## 17 Belgium 231
## 18 France 229
## 19 Romania 224
## 20 Slovakia 224
## 21 United States 222
## 22 United Kingdom 221
## 23 Portugal 216
## 24 Greece 212
## 25 Bulgaria 211
## 26 Malta 209
## 27 Mexico 209
## 28 Argentina 203
## 29 India 202
## 30 Poland 195
## # ... with 188 more rows
covid_data %>% filter(!is.na(continent), !is.na(hosp_patients)) %>% # group_by(location) %>%
count(location) %>% arrange(desc(n)) %>% print(n=30)## # A tibble: 30 x 2
## location n
## <chr> <int>
## 1 France 539
## 2 Estonia 533
## 3 Israel 533
## 4 Italy 532
## 5 Portugal 530
## 6 Sweden 529
## 7 Czechia 526
## 8 Canada 525
## 9 Netherlands 524
## 10 Hungary 522
## 11 Cyprus 518
## 12 Ireland 517
## 13 Slovenia 517
## 14 Belgium 512
## 15 Luxembourg 511
## 16 Switzerland 504
## 17 United Kingdom 504
## 18 Latvia 496
## 19 Austria 495
## 20 Bulgaria 490
## 21 Croatia 481
## 22 Poland 474
## 23 Slovakia 466
## 24 Norway 460
## 25 Denmark 420
## 26 United States 397
## 27 Iceland 365
## 28 Finland 329
## 29 Spain 233
## 30 Lithuania 230
Now we can focus on a subset of countries that have more complete vaccination and hospitalisation rates data, so we could compare them.
countries_subset <- c("Italy", "United States", "Israel", "United Kingdom", "France", "Czechia")
# subset our original data to these countries
hosp_data <- covid_data %>% filter(location %in% countries_subset)
# define a new colour palette for these countries
col_pal <- "ggsci::category10_d3"Let's look at hospitalisation rates first.
ggplot(hosp_data, aes(x=date, y = hosp_patients,colour=location)) +
geom_line(size=0.75) + scale_y_continuous(labels=comma) +
scale_color_paletteer_d(col_pal) +
scale_x_date(name = NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country", y = "Hospitalised patients") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())Can you identify the "waves" in each country?
It's hard to see the details in the countries with lower number of hospitalised patients, how can we improve the visualisation?
Look at hospitalision rates proportional to the population size!
# hosp per population size
p1 <- ggplot(hosp_data,
aes(x=date, y = hosp_patients_per_million,colour=location)) +
geom_line(size=0.75) +
scale_y_continuous(labels=comma) +
scale_color_paletteer_d(col_pal) +
scale_x_date(name = NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country", y = "Hospitalised patients (per million)") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())
p1Now let's try to compare this to vaccination rates
# total vaccination per population
p2 <- ggplot(hosp_data,
aes(x=date, y = people_fully_vaccinated_per_hundred ,colour=location)) +
geom_line(size=0.75, linetype="dashed") +
guides(color = guide_legend(override.aes = list(linetype="solid") ) ) +
scale_y_continuous(labels=comma) +
scale_color_paletteer_d(col_pal) +
scale_x_date(name = NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country", y = "Fully vaccinated (per hundred)") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())
p2What will be the best way to compare these values?
(p1 + guides(color=FALSE))+ p2 + plot_layout(guides = 'collect')# show graphs side by sideMaybe like this:
(p1 + guides(color=FALSE)) / (p2 + theme(legend.position = "bottom")) #+ plot_layout(guides = 'collect')# show graphs on top of each otherAny suggestions?
There's a lot of empty "real estate" in the vaccination graph, maybe we could trim off 2020?
p3 <- ggplot(hosp_data,
aes(x=date, y = people_fully_vaccinated_per_hundred ,colour=location)) +
geom_line(size=0.75, linetype="dashed") +
guides(color = guide_legend(override.aes = list(linetype="solid") ) ) +
scale_y_continuous(labels=comma) +
scale_color_paletteer_d(col_pal) +
scale_x_date(name = NULL,
limits = c(dmy("01-01-2021"), NA),
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country", y = "Fully vaccinated (per hundred)") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())
(p1 + guides(color=FALSE)) + p3 + plot_layout(guides = 'collect', widths = c(2, 1)) # maybe like this?Let's try to present them on the same graph (note the trick with the secondary y-axis).
# show on the same graph
p4 <- ggplot(hosp_data,
aes(x=date, colour=location)) +
geom_line(aes(y = hosp_patients_per_million), size=0.75) +
geom_line(aes(y = people_fully_vaccinated_per_hundred*10), size=0.75, linetype="dashed") +
scale_y_continuous(labels=comma, name = "Hospitalised patients per million (solid)",
# Add a second axis and specify its features
sec.axis = sec_axis(trans=~./10, name="Fully vaccinated per hundred (dashed)")) +
scale_color_paletteer_d(col_pal) +
scale_x_date(NULL,
breaks = breaks_width("2 months"),
labels = label_date_short()) +
labs(color="Country") +
theme_bw(def_text_size) +
theme(panel.grid.minor = element_blank())
p4Probably best to present them on the same graph (note the trick with the secondary y-axis), but for each country separately (done with the facet_wrap() function).
# show on the same graph, but separate each country
p4 +
scale_x_date(NULL,
breaks = breaks_width("4 months"),
labels = label_date_short()) +
facet_wrap(~location)Anything to worry about???
Save the plot to a folder.
# create output folder
dir.create("./output", showWarnings = FALSE)
# save the plot to pdf file
ggsave("output/hospit_vacc_rates_facet_country.pdf", width=14, height = 8)1. Mortalities (Case Fatality Rate)?
2. Suggestions? (cases per population density, vaccination rates by country income, deaths by number of beds per capita, etc.
3. Level of reporting in each country...
Please contact me at i.bar@griffith.edu.au for any questions or comments.
Mathieu E, Ritchie H, Ortiz-Ospina E, et al. (2021) A global database of COVID-19 vaccinations. Nat Hum Behav 5:947–953. doi: 10.1038/s41562-021-01122-8